Part I - (A research on Loan data records from prosper in the United States.)¶

by (Ebenezer Mayowa PEACE)¶

Introduction¶

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.

The analysis in this worksheet/notebook would be structured to enable us provide univariate, bivariate and multivariate relationships, this research would provide answers to questions like how many homeowners borrow in a state, which year has the highest rate of borrowing homeowners, which state has the highest amount of borrowing homeowners, is there a correlation between your available bank credit and your borrowing capacity, identifying the frequency of the categorical variables; Term of loan, Borrower's employment status, year of loan, and loan status, are there differences between loans depending on the loan term?. We have 81 features/columns attributed to each record of loans in the dataset and the descriptions of the features are detailed below:

  • ListingKey Unique key for each listing, same value as the 'key' used in the listing object in the API.
  • ListingNumber The number that uniquely identifies the listing to the public as displayed on the website.
  • ListingCreationDate The date the listing was created.
  • CreditGrade The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings.
  • Term The length of the loan expressed in months.
  • LoanStatus The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, ---- - PastDue. The PastDue status will be accompanied by a delinquency bucket.
  • ClosedDate Closed date is applicable for Cancelled, Completed, Chargedoff and Defaulted loan statuses.
  • BorrowerAPR The Borrower's Annual Percentage Rate (APR) for the loan.
  • BorrowerRate The Borrower's interest rate for this loan.
  • LenderYield The Lender yield on the loan. Lender yield is equal to the interest rate on the loan less the servicing fee.
  • EstimatedEffectiveYield Effective yield is equal to the borrower interest rate (i) minus the servicing fee rate, (ii) minus estimated uncollected interest on charge-offs, (iii) plus estimated collected late fees. Applicable for loans originated after July 2009.
  • EstimatedLoss Estimated loss is the estimated principal loss on charge-offs. Applicable for loans originated after July 2009.
  • EstimatedReturn The estimated return assigned to the listing at the time it was created. Estimated return is the difference between the Estimated Effective Yield and the Estimated Loss Rate. Applicable for loans originated after July 2009.
  • ProsperRating (numeric) The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009.
  • ProsperRating (Alpha) The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009.
  • ProsperScore A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009.
  • ListingCategory The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - --Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
  • BorrowerState The two letter abbreviation of the state of the address of the borrower at the time the Listing was created.
  • Occupation The Occupation selected by the Borrower at the time they created the listing.
  • EmploymentStatus The employment status of the borrower at the time they posted the listing.
  • EmploymentStatusDuration The length in months of the employment status at the time the listing was created.
  • IsBorrowerHomeowner A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
  • CurrentlyInGroup Specifies whether or not the Borrower was in a group at the time the listing was created.
  • GroupKey The Key of the group in which the Borrower is a member of. Value will be null if the borrower does not have a group affiliation.
  • DateCreditPulled The date the credit profile was pulled.
  • CreditScoreRangeLower The lower value representing the range of the borrower's credit score as provided by a consumer credit rating agency.
  • CreditScoreRangeUpper The upper value representing the range of the borrower's credit score as provided by a consumer credit rating agency.
  • FirstRecordedCreditLine The date the first credit line was opened.
  • CurrentCreditLines Number of current credit lines at the time the credit profile was pulled.
  • OpenCreditLines Number of open credit lines at the time the credit profile was pulled.
  • TotalCreditLinespast7years Number of credit lines in the past seven years at the time the credit profile was pulled.
  • OpenRevolvingAccounts Number of open revolving accounts at the time the credit profile was pulled.
  • OpenRevolvingMonthlyPayment Monthly payment on revolving accounts at the time the credit profile was pulled.
  • InquiriesLast6Months Number of inquiries in the past six months at the time the credit profile was pulled.
  • TotalInquiries Total number of inquiries at the time the credit profile was pulled.
  • CurrentDelinquencies Number of accounts delinquent at the time the credit profile was pulled.
  • AmountDelinquent Dollars delinquent at the time the credit profile was pulled.
  • DelinquenciesLast7Years Number of delinquencies in the past 7 years at the time the credit profile was pulled.
  • PublicRecordsLast10Years Number of public records in the past 10 years at the time the credit profile was pulled.
  • PublicRecordsLast12Months Number of public records in the past 12 months at the time the credit profile was pulled.
  • RevolvingCreditBalance Dollars of revolving credit at the time the credit profile was pulled.
  • BankcardUtilization The percentage of available revolving credit that is utilized at the time the credit profile was pulled.
  • AvailableBankcardCredit The total available credit via bank card at the time the credit profile was pulled.
  • TotalTrades Number of trade lines ever opened at the time the credit profile was pulled.
  • TradesNeverDelinquent Number of trades that have never been delinquent at the time the credit profile was pulled.
  • TradesOpenedLast6Months Number of trades opened in the last 6 months at the time the credit profile was pulled.
  • DebtToIncomeRatio The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%).
  • IncomeRange The income range of the borrower at the time the listing was created.
  • IncomeVerifiable The borrower indicated they have the required documentation to support their income.
  • StatedMonthlyIncome The monthly income the borrower stated at the time the listing was created.
  • LoanKey Unique key for each loan. This is the same key that is used in the API.
  • TotalProsperLoans Number of Prosper loans the borrower at the time they created this listing. This value will be null if the borrower had no prior loans.
  • TotalProsperPaymentsBilled Number of on time payments the borrower made on Prosper loans at the time they created this listing. This value will be null if the borrower had no prior loans.
  • OnTimeProsperPayments Number of on time payments the borrower had made on Prosper loans at the time they created this listing. This value will be null if the borrower has no prior loans.
  • ProsperPaymentsLessThanOneMonthLate Number of payments the borrower made on Prosper loans that were less than one month late at the time they created this listing. This value will be null if the borrower had no prior loans.
  • ProsperPaymentsOneMonthPlusLate Number of payments the borrower made on Prosper loans that were greater than one month late at the time they created this listing. This value will be null if the borrower had no prior loans.
  • ProsperPrincipalBorrowed Total principal borrowed on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans.
  • ProsperPrincipalOutstanding Principal outstanding on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans.
  • ScorexChangeAtTimeOfListing Borrower's credit score change at the time the credit profile was pulled. This will be the change relative to the borrower's last Prosper loan. This value will be null if the borrower had no prior loans.
  • LoanCurrentDaysDelinquent The number of days delinquent.
  • LoanFirstDefaultedCycleNumber The cycle the loan was charged off. If the loan has not charged off the value will be null.
  • LoanMonthsSinceOrigination Number of months since the loan originated.
  • LoanNumber Unique numeric value associated with the loan.
  • LoanOriginalAmount The origination amount of the loan.
  • LoanOriginationDate The date the loan was originated.
  • LoanOriginationQuarter The quarter in which the loan was originated.
  • MemberKey The unique key that is associated with the borrower. This is the same identifier that is used in the API member object.
  • MonthlyLoanPayment The scheduled monthly loan payment.
  • LP_CustomerPayments Pre charge-off cumulative gross payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries.
  • LP_CustomerPrincipalPayments Pre charge-off cumulative principal payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries.
  • LP_InterestandFees Pre charge-off cumulative interest and fees paid by the borrower. If the loan has charged off, this value will exclude any recoveries.
  • LP_ServiceFees Cumulative service fees paid by the investors who have invested in the loan.
  • LP_CollectionFees Cumulative collection fees paid by the investors who have invested in the loan.
  • LP_GrossPrincipalLoss The gross charged off amount of the loan.
  • LP_NetPrincipalLoss The principal that remains uncollected after any recoveries.
  • LP_NonPrincipalRecoverypayments The interest and fee component of any recovery payments. The current payment policy applies payments in the following order: Fees, interest, principal.
  • PercentFunded Percent the listing was funded.
  • Recommendations Number of recommendations the borrower had at the time the listing was created.
  • InvestmentFromFriendsCount Number of friends that made an investment in the loan.
  • InvestmentFromFriendsAmount Dollar amount of investments that were made by friends.
  • Investors The number of investors that funded the loan.

Preliminary Wrangling¶

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
sns.color_palette("colorblind")
%matplotlib inline
In [2]:
# loading the dataset...
df = pd.read_csv('prosperLoanData.csv')
#...and calling the head property to see the first five(5) values
df.head(5)
Out[2]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
0 1021339766868145413AB3B 193129 2007-08-26 19:09:29.263000000 C 36 Completed 2009-08-14 00:00:00 0.16516 0.1580 0.1380 ... -133.18 0.0 0.0 0.0 0.0 1.0 0 0 0.0 258
1 10273602499503308B223C1 1209647 2014-02-27 08:28:07.900000000 NaN 36 Current NaN 0.12016 0.0920 0.0820 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
2 0EE9337825851032864889A 81716 2007-01-05 15:00:47.090000000 HR 36 Completed 2009-12-17 00:00:00 0.28269 0.2750 0.2400 ... -24.20 0.0 0.0 0.0 0.0 1.0 0 0 0.0 41
3 0EF5356002482715299901A 658116 2012-10-22 11:02:35.010000000 NaN 36 Current NaN 0.12528 0.0974 0.0874 ... -108.01 0.0 0.0 0.0 0.0 1.0 0 0 0.0 158
4 0F023589499656230C5E3E2 909464 2013-09-14 18:38:39.097000000 NaN 36 Current NaN 0.24614 0.2085 0.1985 ... -60.27 0.0 0.0 0.0 0.0 1.0 0 0 0.0 20

5 rows × 81 columns

In [3]:
#alright, let us see our data's column structure
df.columns
Out[3]:
Index(['ListingKey', 'ListingNumber', 'ListingCreationDate', 'CreditGrade',
       'Term', 'LoanStatus', 'ClosedDate', 'BorrowerAPR', 'BorrowerRate',
       'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss',
       'EstimatedReturn', 'ProsperRating (numeric)', 'ProsperRating (Alpha)',
       'ProsperScore', 'ListingCategory (numeric)', 'BorrowerState',
       'Occupation', 'EmploymentStatus', 'EmploymentStatusDuration',
       'IsBorrowerHomeowner', 'CurrentlyInGroup', 'GroupKey',
       'DateCreditPulled', 'CreditScoreRangeLower', 'CreditScoreRangeUpper',
       'FirstRecordedCreditLine', 'CurrentCreditLines', 'OpenCreditLines',
       'TotalCreditLinespast7years', 'OpenRevolvingAccounts',
       'OpenRevolvingMonthlyPayment', 'InquiriesLast6Months', 'TotalInquiries',
       'CurrentDelinquencies', 'AmountDelinquent', 'DelinquenciesLast7Years',
       'PublicRecordsLast10Years', 'PublicRecordsLast12Months',
       'RevolvingCreditBalance', 'BankcardUtilization',
       'AvailableBankcardCredit', 'TotalTrades',
       'TradesNeverDelinquent (percentage)', 'TradesOpenedLast6Months',
       'DebtToIncomeRatio', 'IncomeRange', 'IncomeVerifiable',
       'StatedMonthlyIncome', 'LoanKey', 'TotalProsperLoans',
       'TotalProsperPaymentsBilled', 'OnTimeProsperPayments',
       'ProsperPaymentsLessThanOneMonthLate',
       'ProsperPaymentsOneMonthPlusLate', 'ProsperPrincipalBorrowed',
       'ProsperPrincipalOutstanding', 'ScorexChangeAtTimeOfListing',
       'LoanCurrentDaysDelinquent', 'LoanFirstDefaultedCycleNumber',
       'LoanMonthsSinceOrigination', 'LoanNumber', 'LoanOriginalAmount',
       'LoanOriginationDate', 'LoanOriginationQuarter', 'MemberKey',
       'MonthlyLoanPayment', 'LP_CustomerPayments',
       'LP_CustomerPrincipalPayments', 'LP_InterestandFees', 'LP_ServiceFees',
       'LP_CollectionFees', 'LP_GrossPrincipalLoss', 'LP_NetPrincipalLoss',
       'LP_NonPrincipalRecoverypayments', 'PercentFunded', 'Recommendations',
       'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount',
       'Investors'],
      dtype='object')
In [4]:
# Phew, that is a lot, let us explore our data further by checking
# the structure in other words, shape of our dataset
df.shape
Out[4]:
(113937, 81)

We have 113937 rows and 81 columns, that is a lot, but we will not be needing them all for our questions, before then, let us explore our data further to check for some errors or correction we might need to make.

In [5]:
#checking our dataset info, 
# data type and our column count to see if all entries are accurate
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   ListingKey                           113937 non-null  object 
 1   ListingNumber                        113937 non-null  int64  
 2   ListingCreationDate                  113937 non-null  object 
 3   CreditGrade                          28953 non-null   object 
 4   Term                                 113937 non-null  int64  
 5   LoanStatus                           113937 non-null  object 
 6   ClosedDate                           55089 non-null   object 
 7   BorrowerAPR                          113912 non-null  float64
 8   BorrowerRate                         113937 non-null  float64
 9   LenderYield                          113937 non-null  float64
 10  EstimatedEffectiveYield              84853 non-null   float64
 11  EstimatedLoss                        84853 non-null   float64
 12  EstimatedReturn                      84853 non-null   float64
 13  ProsperRating (numeric)              84853 non-null   float64
 14  ProsperRating (Alpha)                84853 non-null   object 
 15  ProsperScore                         84853 non-null   float64
 16  ListingCategory (numeric)            113937 non-null  int64  
 17  BorrowerState                        108422 non-null  object 
 18  Occupation                           110349 non-null  object 
 19  EmploymentStatus                     111682 non-null  object 
 20  EmploymentStatusDuration             106312 non-null  float64
 21  IsBorrowerHomeowner                  113937 non-null  bool   
 22  CurrentlyInGroup                     113937 non-null  bool   
 23  GroupKey                             13341 non-null   object 
 24  DateCreditPulled                     113937 non-null  object 
 25  CreditScoreRangeLower                113346 non-null  float64
 26  CreditScoreRangeUpper                113346 non-null  float64
 27  FirstRecordedCreditLine              113240 non-null  object 
 28  CurrentCreditLines                   106333 non-null  float64
 29  OpenCreditLines                      106333 non-null  float64
 30  TotalCreditLinespast7years           113240 non-null  float64
 31  OpenRevolvingAccounts                113937 non-null  int64  
 32  OpenRevolvingMonthlyPayment          113937 non-null  float64
 33  InquiriesLast6Months                 113240 non-null  float64
 34  TotalInquiries                       112778 non-null  float64
 35  CurrentDelinquencies                 113240 non-null  float64
 36  AmountDelinquent                     106315 non-null  float64
 37  DelinquenciesLast7Years              112947 non-null  float64
 38  PublicRecordsLast10Years             113240 non-null  float64
 39  PublicRecordsLast12Months            106333 non-null  float64
 40  RevolvingCreditBalance               106333 non-null  float64
 41  BankcardUtilization                  106333 non-null  float64
 42  AvailableBankcardCredit              106393 non-null  float64
 43  TotalTrades                          106393 non-null  float64
 44  TradesNeverDelinquent (percentage)   106393 non-null  float64
 45  TradesOpenedLast6Months              106393 non-null  float64
 46  DebtToIncomeRatio                    105383 non-null  float64
 47  IncomeRange                          113937 non-null  object 
 48  IncomeVerifiable                     113937 non-null  bool   
 49  StatedMonthlyIncome                  113937 non-null  float64
 50  LoanKey                              113937 non-null  object 
 51  TotalProsperLoans                    22085 non-null   float64
 52  TotalProsperPaymentsBilled           22085 non-null   float64
 53  OnTimeProsperPayments                22085 non-null   float64
 54  ProsperPaymentsLessThanOneMonthLate  22085 non-null   float64
 55  ProsperPaymentsOneMonthPlusLate      22085 non-null   float64
 56  ProsperPrincipalBorrowed             22085 non-null   float64
 57  ProsperPrincipalOutstanding          22085 non-null   float64
 58  ScorexChangeAtTimeOfListing          18928 non-null   float64
 59  LoanCurrentDaysDelinquent            113937 non-null  int64  
 60  LoanFirstDefaultedCycleNumber        16952 non-null   float64
 61  LoanMonthsSinceOrigination           113937 non-null  int64  
 62  LoanNumber                           113937 non-null  int64  
 63  LoanOriginalAmount                   113937 non-null  int64  
 64  LoanOriginationDate                  113937 non-null  object 
 65  LoanOriginationQuarter               113937 non-null  object 
 66  MemberKey                            113937 non-null  object 
 67  MonthlyLoanPayment                   113937 non-null  float64
 68  LP_CustomerPayments                  113937 non-null  float64
 69  LP_CustomerPrincipalPayments         113937 non-null  float64
 70  LP_InterestandFees                   113937 non-null  float64
 71  LP_ServiceFees                       113937 non-null  float64
 72  LP_CollectionFees                    113937 non-null  float64
 73  LP_GrossPrincipalLoss                113937 non-null  float64
 74  LP_NetPrincipalLoss                  113937 non-null  float64
 75  LP_NonPrincipalRecoverypayments      113937 non-null  float64
 76  PercentFunded                        113937 non-null  float64
 77  Recommendations                      113937 non-null  int64  
 78  InvestmentFromFriendsCount           113937 non-null  int64  
 79  InvestmentFromFriendsAmount          113937 non-null  float64
 80  Investors                            113937 non-null  int64  
dtypes: bool(3), float64(50), int64(11), object(17)
memory usage: 68.1+ MB
In [6]:
#Let us check for duplicated values
df[df.duplicated()]
Out[6]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors

0 rows × 81 columns

In [8]:
#No duplicates! Good now lets check for null values in one of our columns

df[df.LoanStatus.isnull()]
Out[8]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors

0 rows × 81 columns

We needed to check if all the values are entered for the column hence why we checked for the null values in the column.

Check for the loan term, one of our cogent columns, we need this column because, loan terms should have a category and we have to be sure if the data provided the loan term in category or not.¶

In [9]:
df.Term.value_counts()
Out[9]:
36    87778
60    24545
12     1614
Name: Term, dtype: int64
In [10]:
#it is not in category format but instead in numerical, we need to
#change this

df.Term.dtype
Out[10]:
dtype('int64')
In [11]:
#Convert to a categorical data type with each category named
df.Term = df.Term.map({36: 'Medium Term', 60:'Long Term', 12:'Short Term'})
In [12]:
df.Term.value_counts()
Out[12]:
Medium Term    87778
Long Term      24545
Short Term      1614
Name: Term, dtype: int64
In [13]:
#Good, now lets convert to a categorical data type, just to be sure
df.Term = df.Term.astype('category')
df.Term.dtype
Out[13]:
CategoricalDtype(categories=['Long Term', 'Medium Term', 'Short Term'], ordered=False)
We also discovered that another very important column, Loan Status needs to be in categorical format, we need this because, we have to know whether a loan is completed or not.¶
In [14]:
df.LoanStatus.value_counts()
Out[14]:
Current                   56576
Completed                 38074
Chargedoff                11992
Defaulted                  5018
Past Due (1-15 days)        806
Past Due (31-60 days)       363
Past Due (61-90 days)       313
Past Due (91-120 days)      304
Past Due (16-30 days)       265
FinalPaymentInProgress      205
Past Due (>120 days)         16
Cancelled                     5
Name: LoanStatus, dtype: int64
In [15]:
df.LoanStatus = df.LoanStatus.astype('category')
df.LoanStatus.dtype
Out[15]:
CategoricalDtype(categories=['Cancelled', 'Chargedoff', 'Completed', 'Current',
                  'Defaulted', 'FinalPaymentInProgress',
                  'Past Due (1-15 days)', 'Past Due (16-30 days)',
                  'Past Due (31-60 days)', 'Past Due (61-90 days)',
                  'Past Due (91-120 days)', 'Past Due (>120 days)'],
, ordered=False)
In [16]:
# Since all past due dates are due dates,let us convert the Past
# Due with days to a single past due statement
df.LoanStatus = df.LoanStatus.replace({'Past Due (1-15 days)':'Past Due', 'Past Due (16-30 days)':'Past Due',
                  'Past Due (31-60 days)':'Past Due', 'Past Due (61-90 days)' :'Past Due',
                  'Past Due (91-120 days)' :'Past Due', 'Past Due (>120 days)' :'Past Due'})
df.LoanStatus.value_counts()
Out[16]:
Current                   56576
Completed                 38074
Chargedoff                11992
Defaulted                  5018
Past Due                   2067
FinalPaymentInProgress      205
Cancelled                     5
Name: LoanStatus, dtype: int64
In [17]:
# Good, Let us proceed
# Let us check Home Owners Column

df.IsBorrowerHomeowner.value_counts()
Out[17]:
True     57478
False    56459
Name: IsBorrowerHomeowner, dtype: int64
In [18]:
# date loan was originated
df.LoanOriginationDate.value_counts()
Out[18]:
2014-01-22 00:00:00    491
2013-11-13 00:00:00    490
2014-02-19 00:00:00    439
2013-10-16 00:00:00    434
2014-01-28 00:00:00    339
                      ... 
2006-02-03 00:00:00      1
2006-01-24 00:00:00      1
2005-11-18 00:00:00      1
2009-07-20 00:00:00      1
2005-11-15 00:00:00      1
Name: LoanOriginationDate, Length: 1873, dtype: int64
In [20]:
# loan origination date to be converted to Year, to assist our analysis better
df.LoanOriginationDate = pd.to_datetime(df.LoanOriginationDate)
df.LoanOriginationDate.dtype
Out[20]:
dtype('<M8[ns]')
In [21]:
#Convert to Year instead, we are going to perform our analysis by year
df['Year'] = pd.DatetimeIndex(df['LoanOriginationDate']).year
In [22]:
df.Year
Out[22]:
0         2007
1         2014
2         2007
3         2012
4         2013
          ... 
113932    2013
113933    2011
113934    2013
113935    2011
113936    2014
Name: Year, Length: 113937, dtype: int64
In [23]:
# check data type
df.Year.dtype
Out[23]:
dtype('int64')
In [25]:
# convert year to categorical variable
df.Year = df.Year.astype('category')
df.Year.dtype
Out[25]:
CategoricalDtype(categories=[2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014], ordered=False)

Let us check for other data, just to make sure we are on the right track

In [26]:
# value count of verfiable income
df.IncomeVerifiable.value_counts() 
Out[26]:
True     105268
False      8669
Name: IncomeVerifiable, dtype: int64
In [27]:
# number of recommedations values
df.Recommendations.value_counts()
Out[27]:
0     109678
1       3516
2        568
3        108
4         26
5         14
9          6
7          5
6          4
8          3
18         2
16         2
14         1
21         1
24         1
19         1
39         1
Name: Recommendations, dtype: int64
In [28]:
# emolyment status values
df.EmploymentStatus.value_counts()
Out[28]:
Employed         67322
Full-time        26355
Self-employed     6134
Not available     5347
Other             3806
Part-time         1088
Not employed       835
Retired            795
Name: EmploymentStatus, dtype: int64
In [29]:
# sum of missing values
df.EmploymentStatus.isnull().sum() 
Out[29]:
2255
In [31]:
# EmploymentStatus missing values records
df[df.EmploymentStatus.isnull()]
Out[31]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors Year
39 109D3366077649289619396 24135 2006-07-11 17:14:14.647000000 B Medium Term Completed 2009-10-22 00:00:00 0.15211 0.1450 0.1400 ... -299.78 0.00 0.00 0.0 1.0 0 0 0.0 44 2006
92 0FE433661215082408720F0 22591 2006-07-02 11:05:02.857000000 AA Medium Term Completed 2007-03-01 00:00:00 0.08292 0.0755 0.0705 ... 0.00 0.00 0.00 0.0 1.0 0 0 0.0 21 2006
146 34B03365511229087456DB5 31745 2006-08-16 09:43:10.867000000 A Medium Term Completed 2006-12-04 00:00:00 0.09939 0.0925 0.0850 ... 0.00 0.00 0.00 0.0 1.0 0 0 0.0 67 2006
162 10463364871408702EBC041 17578 2006-06-03 07:46:36.343000000 AA Medium Term Completed 2007-01-09 00:00:00 0.08483 0.0780 0.0730 ... 0.00 0.00 0.00 0.0 1.0 0 0 0.0 93 2006
193 0F6133644711535583E92C3 17091 2006-05-31 23:48:17.177000000 E Medium Term Defaulted 2007-10-08 00:00:00 0.23937 0.2300 0.2250 ... 0.00 1590.40 1584.56 0.0 1.0 0 0 0.0 13 2006
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
113663 DBD7336509210597772988B 21182 2006-06-22 12:04:17.483000000 HR Medium Term Chargedoff 2008-12-06 00:00:00 0.29525 0.2875 0.2450 ... 0.00 1081.21 1081.22 0.0 1.0 0 0 0.0 22 2006
113712 DC1033648016334634330F6 18121 2006-06-06 15:08:32.807000000 HR Medium Term Defaulted 2007-05-16 00:00:00 0.24502 0.2375 0.2325 ... -93.79 5315.12 5340.58 0.0 1.0 0 0 0.0 51 2006
113902 E6D13366465509973F5E90A 3579 2006-03-26 00:11:04.620000000 A Medium Term Completed 2006-05-24 00:00:00 0.08805 0.0812 0.0750 ... 0.00 0.00 0.00 0.0 1.0 0 0 0.0 143 2006
113914 D20533652658825353F6597 30223 2006-08-09 14:34:40.010000000 HR Medium Term Defaulted 2007-03-19 00:00:00 0.25757 0.2500 0.2450 ... 0.00 2883.01 2883.00 0.0 1.0 0 0 0.0 1 2006
113921 D21A33647075665665BE266 4 2005-11-09 20:44:28.847000000 AA Medium Term Completed 2005-11-25 00:00:00 NaN 0.0400 0.0350 ... 0.00 0.00 0.00 0.0 1.0 0 0 0.0 1 2005

2255 rows × 82 columns

In [32]:
# Percentage (%) of missing values
df.EmploymentStatus.isnull().mean() * 100
Out[32]:
1.9791639239228696

We can see that, the column data type in Employment Status should be categorical with a 2255 sum of missing values representing 1.98% of missing values in the colum, replacing the missing value with the most occuring employment status(because we need accuracy in our data) and changing the data type to categorical data type will suffice.

In [33]:
# replace missing value with most occured value
df.EmploymentStatus.fillna('Employed', inplace = True) 
#and change data type to categorical
df.EmploymentStatus = df.EmploymentStatus.astype('category')
# check data type
df.EmploymentStatus.dtype
Out[33]:
CategoricalDtype(categories=['Employed', 'Full-time', 'Not available', 'Not employed',
                  'Other', 'Part-time', 'Retired', 'Self-employed'],
, ordered=False)
In [34]:
#Good, let us continue our data exploration
# stated monthly income 
df.StatedMonthlyIncome
Out[34]:
0         3083.333333
1         6125.000000
2         2083.333333
3         2875.000000
4         9583.333333
             ...     
113932    4333.333333
113933    8041.666667
113934    2875.000000
113935    3875.000000
113936    4583.333333
Name: StatedMonthlyIncome, Length: 113937, dtype: float64
In [35]:
# number of days of loan delinquency values
df.LoanCurrentDaysDelinquent.value_counts() 
Out[35]:
0       94860
121       293
10        192
4          77
11         62
        ...  
582         1
1099        1
2555        1
787         1
1081        1
Name: LoanCurrentDaysDelinquent, Length: 2411, dtype: int64
In [36]:
# sorted from highest days of delinquency to lowest
df.LoanCurrentDaysDelinquent.sort_values(ascending = False, ignore_index= True)
Out[36]:
0         2704
1         2703
2         2613
3         2612
4         2599
          ... 
113932       0
113933       0
113934       0
113935       0
113936       0
Name: LoanCurrentDaysDelinquent, Length: 113937, dtype: int64
In [37]:
# stated monthly income 
df.StatedMonthlyIncome
Out[37]:
0         3083.333333
1         6125.000000
2         2083.333333
3         2875.000000
4         9583.333333
             ...     
113932    4333.333333
113933    8041.666667
113934    2875.000000
113935    3875.000000
113936    4583.333333
Name: StatedMonthlyIncome, Length: 113937, dtype: float64

Let us round up the Stated monthly income so we'll have a round figure.

In [38]:
# round up to two decimal
df.StatedMonthlyIncome = df.StatedMonthlyIncome.astype(int)
In [39]:
df.StatedMonthlyIncome
Out[39]:
0         3083
1         6125
2         2083
3         2875
4         9583
          ... 
113932    4333
113933    8041
113934    2875
113935    3875
113936    4583
Name: StatedMonthlyIncome, Length: 113937, dtype: int32
In [40]:
#Let us sort from highest to lowest
df.StatedMonthlyIncome.sort_values(ascending = False, ignore_index= True)
Out[40]:
0         1750002
1          618547
2          483333
3          466666
4          416666
           ...   
113932          0
113933          0
113934          0
113935          0
113936          0
Name: StatedMonthlyIncome, Length: 113937, dtype: int32
In [41]:
#Let us sort from highest to lowest our LoanOriginalAmount
df.LoanOriginalAmount.sort_values(ascending = False, ignore_index= True)
Out[41]:
0         35000
1         35000
2         35000
3         35000
4         35000
          ...  
113932     1000
113933     1000
113934     1000
113935     1000
113936     1000
Name: LoanOriginalAmount, Length: 113937, dtype: int64

FURTHER EXPLORATIONS

In [42]:
# monthly loan payment data type
df.MonthlyLoanPayment.dtype
Out[42]:
dtype('float64')
In [44]:
# convert from float to integer
df.MonthlyLoanPayment = df.MonthlyLoanPayment.astype(int)
In [45]:
df.MonthlyLoanPayment.dtype
Out[45]:
dtype('int32')
In [46]:
# sort from highest monthly loan payment to lowest
df.MonthlyLoanPayment.sort_values(ascending = False, ignore_index= True)
Out[46]:
0         2251
1         2218
2         2179
3         2163
4         2153
          ... 
113932       0
113933       0
113934       0
113935       0
113936       0
Name: MonthlyLoanPayment, Length: 113937, dtype: int32
In [47]:
# sum of missing values in occupation column
df.Occupation.isnull().sum()
Out[47]:
3588
In [48]:
# let us see what the records are
df[df.Occupation.isnull()]
Out[48]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors Year
34 0F6C3603015887476F3F015 1180690 2014-02-24 12:55:20.430000000 NaN Medium Term Current NaN 0.22966 0.1920 0.1820 ... 0.00 0.00 0.00 0.0 1.0 0 0 0.0 1 2014
39 109D3366077649289619396 24135 2006-07-11 17:14:14.647000000 B Medium Term Completed 2009-10-22 00:00:00 0.15211 0.1450 0.1400 ... -299.78 0.00 0.00 0.0 1.0 0 0 0.0 44 2006
92 0FE433661215082408720F0 22591 2006-07-02 11:05:02.857000000 AA Medium Term Completed 2007-03-01 00:00:00 0.08292 0.0755 0.0705 ... 0.00 0.00 0.00 0.0 1.0 0 0 0.0 21 2006
146 34B03365511229087456DB5 31745 2006-08-16 09:43:10.867000000 A Medium Term Completed 2006-12-04 00:00:00 0.09939 0.0925 0.0850 ... 0.00 0.00 0.00 0.0 1.0 0 0 0.0 67 2006
161 10453596739685417502B89 1086023 2013-12-20 14:57:44.580000000 NaN Medium Term Current NaN 0.17151 0.1355 0.1255 ... 0.00 0.00 0.00 0.0 1.0 0 0 0.0 2 2013
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
113709 DC043594246199576475BB4 1017315 2013-11-05 05:44:05.173000000 NaN Medium Term Current NaN 0.33215 0.2925 0.2825 ... 0.00 0.00 0.00 0.0 1.0 0 0 0.0 1 2013
113712 DC1033648016334634330F6 18121 2006-06-06 15:08:32.807000000 HR Medium Term Defaulted 2007-05-16 00:00:00 0.24502 0.2375 0.2325 ... -93.79 5315.12 5340.58 0.0 1.0 0 0 0.0 51 2006
113902 E6D13366465509973F5E90A 3579 2006-03-26 00:11:04.620000000 A Medium Term Completed 2006-05-24 00:00:00 0.08805 0.0812 0.0750 ... 0.00 0.00 0.00 0.0 1.0 0 0 0.0 143 2006
113914 D20533652658825353F6597 30223 2006-08-09 14:34:40.010000000 HR Medium Term Defaulted 2007-03-19 00:00:00 0.25757 0.2500 0.2450 ... 0.00 2883.01 2883.00 0.0 1.0 0 0 0.0 1 2006
113921 D21A33647075665665BE266 4 2005-11-09 20:44:28.847000000 AA Medium Term Completed 2005-11-25 00:00:00 NaN 0.0400 0.0350 ... 0.00 0.00 0.00 0.0 1.0 0 0 0.0 1 2005

3588 rows × 82 columns

In [49]:
# Percentage (%) of missing values in occupation 
df.Occupation.isnull().mean()* 100
Out[49]:
3.149108717975723

In our occupation value column, we have 3588 values that are null, amounting to 3.149% of key values missing. hence, we have to fill in the null values with a keyword that is identifiable;"Unknown"

We also need to sort out our Occupation Value column in a categorical way to ensure that we have a categorical data

In [50]:
# value count of occupation
df.Occupation.value_counts()
Out[50]:
Other                          28617
Professional                   13628
Computer Programmer             4478
Executive                       4311
Teacher                         3759
                               ...  
Dentist                           68
Student - College Freshman        41
Student - Community College       28
Judge                             22
Student - Technical School        16
Name: Occupation, Length: 67, dtype: int64
In [51]:
# list of all occuption recorded in our dataset
list(df.Occupation)
Out[51]:
['Other',
 'Professional',
 'Other',
 'Skilled Labor',
 'Executive',
 'Professional',
 'Sales - Retail',
 'Laborer',
 'Food Service',
 'Food Service',
 'Fireman',
 'Waiter/Waitress',
 'Sales - Retail',
 'Construction',
 'Computer Programmer',
 'Other',
 'Professional',
 'Professional',
 'Sales - Commission',
 'Laborer',
 'Retail Management',
 'Professional',
 'Other',
 'Skilled Labor',
 'Other',
 'Engineer - Mechanical',
 'Sales - Commission',
 'Executive',
 'Military Enlisted',
 'Other',
 'Other',
 'Other',
 'Clerical',
 'Other',
 nan,
 'Retail Management',
 'Professional',
 'Teacher',
 'Other',
 nan,
 'Other',
 'Other',
 'Other',
 'Clergy',
 'Professional',
 'Executive',
 'Accountant/CPA',
 'Professional',
 'Attorney',
 'Professional',
 'Nurse (RN)',
 'Nurse (RN)',
 'Other',
 'Accountant/CPA',
 'Executive',
 'Retail Management',
 'Retail Management',
 'Construction',
 'Clergy',
 'Other',
 'Teacher',
 'Analyst',
 'Other',
 'Analyst',
 'Sales - Commission',
 'Other',
 'Other',
 'Professional',
 "Nurse's Aide",
 'Investor',
 'Retail Management',
 'Other',
 'Other',
 'Other',
 'Fireman',
 'Skilled Labor',
 'Executive',
 'Analyst',
 'Realtor',
 'Skilled Labor',
 'Other',
 'Professional',
 'Other',
 'Accountant/CPA',
 'Flight Attendant',
 'Other',
 'Nurse (LPN)',
 'Professional',
 'Construction',
 'Construction',
 'Clerical',
 'Other',
 nan,
 'Laborer',
 'Military Officer',
 'Analyst',
 'Clerical',
 'Sales - Retail',
 'Computer Programmer',
 'Other',
 'Other',
 'Food Service Management',
 'Professional',
 'Truck Driver',
 'Other',
 'Sales - Commission',
 'Other',
 'Administrative Assistant',
 'Executive',
 'Police Officer/Correction Officer',
 'Nurse (RN)',
 'Laborer',
 'Social Worker',
 'Other',
 'Computer Programmer',
 'Professional',
 'Executive',
 'Professional',
 'Teacher',
 'Clerical',
 'Laborer',
 'Military Officer',
 'Other',
 'Sales - Retail',
 'Construction',
 'Other',
 'Sales - Commission',
 'Administrative Assistant',
 'Other',
 'Retail Management',
 'Food Service Management',
 'Other',
 'Military Enlisted',
 'Other',
 'Professional',
 'Skilled Labor',
 'Sales - Commission',
 'Construction',
 'Military Enlisted',
 'Other',
 'Accountant/CPA',
 'Truck Driver',
 'Analyst',
 'Fireman',
 'Sales - Commission',
 'Tradesman - Mechanic',
 nan,
 'Professional',
 'Accountant/CPA',
 'Professional',
 'Professional',
 'Executive',
 'Skilled Labor',
 'Skilled Labor',
 'Teacher',
 'Computer Programmer',
 'Professional',
 'Food Service',
 'Tradesman - Mechanic',
 'Analyst',
 'Military Officer',
 nan,
 nan,
 'Other',
 'Other',
 'Other',
 'Executive',
 'Medical Technician',
 'Executive',
 'Professor',
 'Other',
 'Tradesman - Mechanic',
 'Other',
 'Skilled Labor',
 'Teacher',
 'Postal Service',
 'Sales - Commission',
 'Social Worker',
 'Other',
 'Postal Service',
 'Other',
 'Sales - Retail',
 'Computer Programmer',
 'Attorney',
 'Military Enlisted',
 'Teacher',
 'Other',
 'Other',
 'Other',
 'Sales - Commission',
 'Professional',
 'Analyst',
 'Military Enlisted',
 nan,
 'Other',
 'Nurse (RN)',
 'Other',
 'Waiter/Waitress',
 'Professional',
 'Civil Service',
 'Executive',
 'Other',
 'Medical Technician',
 'Other',
 'Teacher',
 'Other',
 nan,
 'Retail Management',
 'Other',
 'Accountant/CPA',
 'Analyst',
 'Other',
 'Professional',
 'Sales - Commission',
 'Professional',
 'Realtor',
 'Investor',
 'Food Service',
 'Other',
 'Professional',
 'Police Officer/Correction Officer',
 'Other',
 'Pharmacist',
 'Administrative Assistant',
 'Sales - Commission',
 'Nurse (RN)',
 'Other',
 'Professional',
 'Truck Driver',
 nan,
 'Other',
 'Skilled Labor',
 'Other',
 'Other',
 'Nurse (RN)',
 'Truck Driver',
 'Tradesman - Mechanic',
 nan,
 'Other',
 'Professional',
 'Laborer',
 'Administrative Assistant',
 'Professional',
 'Construction',
 'Truck Driver',
 'Computer Programmer',
 'Pharmacist',
 nan,
 'Administrative Assistant',
 'Teacher',
 nan,
 'Other',
 'Clerical',
 'Other',
 'Other',
 'Computer Programmer',
 'Computer Programmer',
 'Clerical',
 'Administrative Assistant',
 'Sales - Retail',
 'Professional',
 'Other',
 'Police Officer/Correction Officer',
 'Professional',
 'Food Service Management',
 'Retail Management',
 'Other',
 'Other',
 'Other',
 'Other',
 'Other',
 'Computer Programmer',
 'Tradesman - Electrician',
 'Other',
 'Computer Programmer',
 'Professional',
 'Other',
 'Other',
 'Truck Driver',
 'Scientist',
 'Engineer - Mechanical',
 'Skilled Labor',
 'Other',
 'Teacher',
 'Accountant/CPA',
 'Retail Management',
 'Accountant/CPA',
 'Dentist',
 'Engineer - Electrical',
 'Other',
 'Skilled Labor',
 'Sales - Commission',
 'Professional',
 'Clerical',
 'Retail Management',
 'Other',
 'Administrative Assistant',
 'Other',
 'Retail Management',
 'Teacher',
 'Computer Programmer',
 'Civil Service',
 'Sales - Retail',
 'Architect',
 'Professional',
 'Other',
 'Other',
 'Professional',
 'Landscaping',
 'Analyst',
 nan,
 'Nurse (RN)',
 'Administrative Assistant',
 'Professional',
 'Executive',
 'Landscaping',
 'Administrative Assistant',
 'Professional',
 'Clerical',
 'Executive',
 'Construction',
 'Professional',
 'Analyst',
 'Executive',
 'Laborer',
 nan,
 'Sales - Retail',
 'Professional',
 'Nurse (LPN)',
 'Waiter/Waitress',
 'Executive',
 'Other',
 nan,
 'Flight Attendant',
 'Executive',
 'Other',
 'Professional',
 'Retail Management',
 'Tradesman - Carpenter',
 'Other',
 'Waiter/Waitress',
 'Administrative Assistant',
 'Other',
 'Tradesman - Electrician',
 'Medical Technician',
 'Other',
 'Retail Management',
 'Truck Driver',
 'Analyst',
 nan,
 'Skilled Labor',
 'Sales - Commission',
 'Executive',
 'Professional',
 'Truck Driver',
 'Police Officer/Correction Officer',
 nan,
 'Civil Service',
 'Professional',
 'Other',
 'Professional',
 'Clerical',
 'Accountant/CPA',
 'Executive',
 'Bus Driver',
 'Professional',
 'Sales - Commission',
 'Sales - Retail',
 'Sales - Retail',
 'Food Service',
 'Skilled Labor',
 'Computer Programmer',
 'Postal Service',
 'Other',
 'Executive',
 'Other',
 'Nurse (LPN)',
 'Analyst',
 'Medical Technician',
 'Other',
 'Professional',
 'Other',
 'Engineer - Mechanical',
 'Teacher',
 'Tradesman - Plumber',
 'Retail Management',
 'Professional',
 nan,
 'Construction',
 'Clerical',
 'Professional',
 'Other',
 'Other',
 'Professional',
 'Computer Programmer',
 'Other',
 'Tradesman - Electrician',
 'Professional',
 'Professional',
 'Administrative Assistant',
 'Executive',
 'Computer Programmer',
 'Other',
 'Other',
 'Bus Driver',
 'Teacher',
 'Architect',
 'Computer Programmer',
 'Executive',
 'Other',
 'Other',
 'Engineer - Chemical',
 'Computer Programmer',
 'Other',
 'Computer Programmer',
 'Professional',
 'Food Service',
 'Other',
 'Doctor',
 'Other',
 'Construction',
 'Other',
 'Analyst',
 'Executive',
 'Other',
 'Administrative Assistant',
 'Chemist',
 'Student - College Senior',
 'Other',
 'Principal',
 'Clerical',
 'Laborer',
 'Other',
 'Other',
 'Civil Service',
 'Skilled Labor',
 'Executive',
 'Professional',
 'Attorney',
 'Professional',
 'Professional',
 'Other',
 'Other',
 'Sales - Commission',
 'Other',
 'Accountant/CPA',
 'Social Worker',
 'Sales - Retail',
 nan,
 'Truck Driver',
 'Skilled Labor',
 'Administrative Assistant',
 'Scientist',
 'Medical Technician',
 'Computer Programmer',
 'Sales - Commission',
 'Computer Programmer',
 'Military Enlisted',
 'Engineer - Mechanical',
 'Other',
 'Doctor',
 'Civil Service',
 'Skilled Labor',
 'Doctor',
 'Sales - Retail',
 "Teacher's Aide",
 'Police Officer/Correction Officer',
 'Skilled Labor',
 'Other',
 'Professional',
 'Professional',
 'Executive',
 nan,
 'Landscaping',
 'Other',
 'Other',
 'Other',
 'Other',
 'Teacher',
 'Other',
 'Waiter/Waitress',
 'Other',
 'Administrative Assistant',
 'Other',
 'Professional',
 'Military Enlisted',
 'Nurse (RN)',
 'Analyst',
 'Skilled Labor',
 'Other',
 'Other',
 'Analyst',
 'Pilot - Private/Commercial',
 'Other',
 'Other',
 'Other',
 'Executive',
 'Other',
 'Professional',
 'Engineer - Mechanical',
 'Other',
 'Sales - Commission',
 'Engineer - Electrical',
 'Skilled Labor',
 'Computer Programmer',
 'Other',
 'Other',
 'Nurse (RN)',
 "Nurse's Aide",
 'Other',
 'Retail Management',
 'Other',
 'Pharmacist',
 'Scientist',
 'Professional',
 'Analyst',
 'Military Enlisted',
 "Teacher's Aide",
 'Other',
 'Other',
 'Sales - Retail',
 'Doctor',
 'Teacher',
 'Professional',
 'Skilled Labor',
 'Other',
 'Accountant/CPA',
 'Professional',
 'Food Service',
 'Food Service',
 'Sales - Retail',
 'Executive',
 nan,
 'Engineer - Electrical',
 'Professional',
 'Other',
 'Other',
 'Professional',
 'Other',
 'Analyst',
 'Truck Driver',
 'Other',
 'Other',
 'Truck Driver',
 'Sales - Retail',
 'Other',
 'Analyst',
 'Professional',
 'Religious',
 'Computer Programmer',
 'Other',
 'Truck Driver',
 'Other',
 'Administrative Assistant',
 'Other',
 'Analyst',
 nan,
 'Other',
 'Other',
 'Retail Management',
 'Food Service',
 'Other',
 'Other',
 'Executive',
 'Other',
 'Homemaker',
 'Engineer - Electrical',
 'Engineer - Mechanical',
 'Executive',
 'Police Officer/Correction Officer',
 'Food Service Management',
 'Retail Management',
 'Professional',
 'Executive',
 nan,
 'Truck Driver',
 'Homemaker',
 'Professional',
 'Professional',
 'Other',
 'Food Service',
 nan,
 'Retail Management',
 'Nurse (RN)',
 'Analyst',
 nan,
 'Accountant/CPA',
 'Other',
 'Clerical',
 'Professional',
 nan,
 'Analyst',
 'Professional',
 'Nurse (RN)',
 'Analyst',
 'Sales - Commission',
 'Nurse (RN)',
 'Retail Management',
 'Computer Programmer',
 'Scientist',
 'Executive',
 'Professional',
 'Sales - Commission',
 'Other',
 'Professional',
 'Other',
 'Analyst',
 'Accountant/CPA',
 'Attorney',
 'Analyst',
 'Other',
 'Executive',
 'Administrative Assistant',
 'Other',
 'Retail Management',
 'Truck Driver',
 'Nurse (RN)',
 'Truck Driver',
 'Other',
 'Tradesman - Mechanic',
 'Sales - Retail',
 'Executive',
 'Tradesman - Electrician',
 'Other',
 'Clergy',
 'Accountant/CPA',
 'Other',
 'Other',
 'Other',
 'Laborer',
 'Other',
 'Other',
 'Other',
 'Analyst',
 'Other',
 'Realtor',
 'Teacher',
 'Doctor',
 'Engineer - Electrical',
 'Social Worker',
 'Professional',
 'Teacher',
 'Engineer - Mechanical',
 'Executive',
 'Sales - Retail',
 'Professional',
 'Executive',
 'Professional',
 'Nurse (RN)',
 nan,
 'Computer Programmer',
 'Professional',
 'Fireman',
 'Other',
 nan,
 'Student - College Graduate Student',
 'Other',
 'Religious',
 'Analyst',
 'Sales - Retail',
 'Military Enlisted',
 'Professional',
 'Professional',
 'Accountant/CPA',
 'Civil Service',
 'Investor',
 'Other',
 'Realtor',
 'Professional',
 'Professional',
 'Accountant/CPA',
 'Truck Driver',
 'Truck Driver',
 'Other',
 'Military Enlisted',
 'Skilled Labor',
 'Professor',
 'Accountant/CPA',
 'Attorney',
 'Other',
 'Other',
 'Attorney',
 'Other',
 'Tradesman - Electrician',
 'Truck Driver',
 'Skilled Labor',
 'Other',
 'Pilot - Private/Commercial',
 'Other',
 'Other',
 'Other',
 'Professional',
 'Computer Programmer',
 'Medical Technician',
 'Professional',
 'Professional',
 'Other',
 'Computer Programmer',
 nan,
 'Other',
 'Realtor',
 'Professional',
 'Analyst',
 'Administrative Assistant',
 "Nurse's Aide",
 'Other',
 'Police Officer/Correction Officer',
 'Executive',
 'Laborer',
 'Professor',
 'Professional',
 'Professional',
 'Professional',
 'Other',
 'Teacher',
 'Tradesman - Electrician',
 'Other',
 'Tradesman - Mechanic',
 'Police Officer/Correction Officer',
 'Analyst',
 'Other',
 'Professional',
 'Tradesman - Mechanic',
 'Other',
 'Computer Programmer',
 'Food Service Management',
 'Police Officer/Correction Officer',
 'Accountant/CPA',
 'Professional',
 'Other',
 'Other',
 'Analyst',
 'Other',
 'Other',
 'Accountant/CPA',
 'Architect',
 'Other',
 'Other',
 'Other',
 'Other',
 'Teacher',
 'Other',
 "Nurse's Aide",
 'Computer Programmer',
 "Nurse's Aide",
 'Analyst',
 'Police Officer/Correction Officer',
 'Tradesman - Plumber',
 'Professional',
 'Other',
 'Professor',
 'Professional',
 'Truck Driver',
 'Sales - Commission',
 'Professional',
 'Other',
 'Other',
 'Doctor',
 'Sales - Commission',
 'Administrative Assistant',
 'Professional',
 'Professional',
 'Student - College Senior',
 'Teacher',
 'Retail Management',
 'Computer Programmer',
 'Executive',
 'Other',
 'Police Officer/Correction Officer',
 'Executive',
 'Administrative Assistant',
 'Nurse (RN)',
 'Other',
 'Other',
 'Other',
 'Professional',
 'Engineer - Electrical',
 'Other',
 'Sales - Retail',
 'Nurse (RN)',
 'Teacher',
 'Clerical',
 'Sales - Retail',
 'Clerical',
 'Tradesman - Mechanic',
 'Civil Service',
 'Other',
 'Other',
 'Other',
 'Engineer - Mechanical',
 'Teacher',
 'Teacher',
 'Professional',
 'Tradesman - Mechanic',
 'Other',
 'Nurse (RN)',
 'Other',
 'Retail Management',
 'Police Officer/Correction Officer',
 'Accountant/CPA',
 'Other',
 'Sales - Retail',
 'Student - Technical School',
 'Other',
 'Other',
 'Food Service Management',
 'Professional',
 "Nurse's Aide",
 'Teacher',
 'Professional',
 'Retail Management',
 'Other',
 'Other',
 'Other',
 'Other',
 'Professional',
 'Sales - Commission',
 'Executive',
 'Other',
 'Truck Driver',
 'Other',
 'Pilot - Private/Commercial',
 'Administrative Assistant',
 'Sales - Retail',
 'Executive',
 'Other',
 'Other',
 'Other',
 'Computer Programmer',
 'Other',
 'Executive',
 'Other',
 nan,
 'Other',
 'Other',
 'Nurse (LPN)',
 nan,
 'Accountant/CPA',
 'Police Officer/Correction Officer',
 'Professional',
 'Engineer - Mechanical',
 'Investor',
 nan,
 'Skilled Labor',
 'Military Officer',
 'Sales - Commission',
 'Social Worker',
 'Executive',
 'Professional',
 'Student - College Senior',
 'Professional',
 'Laborer',
 'Professional',
 nan,
 'Professional',
 'Retail Management',
 'Other',
 'Skilled Labor',
 'Teacher',
 'Executive',
 'Administrative Assistant',
 'Engineer - Mechanical',
 'Other',
 'Other',
 'Other',
 'Computer Programmer',
 'Nurse (RN)',
 'Administrative Assistant',
 'Other',
 'Executive',
 'Construction',
 'Principal',
 'Administrative Assistant',
 'Tradesman - Mechanic',
 'Other',
 'Professional',
 'Police Officer/Correction Officer',
 'Analyst',
 'Other',
 'Teacher',
 'Fireman',
 'Other',
 'Professional',
 'Executive',
 'Teacher',
 'Skilled Labor',
 'Construction',
 nan,
 'Police Officer/Correction Officer',
 'Postal Service',
 'Other',
 'Professional',
 'Other',
 'Nurse (LPN)',
 'Construction',
 'Food Service',
 'Police Officer/Correction Officer',
 'Other',
 'Professional',
 'Accountant/CPA',
 'Executive',
 'Professional',
 'Civil Service',
 'Construction',
 'Analyst',
 nan,
 'Teacher',
 nan,
 'Other',
 'Other',
 'Sales - Commission',
 'Other',
 'Clerical',
 'Computer Programmer',
 'Truck Driver',
 'Medical Technician',
 'Nurse (RN)',
 'Other',
 'Other',
 'Medical Technician',
 'Professional',
 'Other',
 'Other',
 'Accountant/CPA',
 'Laborer',
 'Other',
 'Sales - Commission',
 'Other',
 'Other',
 'Teacher',
 'Nurse (RN)',
 'Retail Management',
 'Administrative Assistant',
 'Investor',
 'Professional',
 'Other',
 'Other',
 'Computer Programmer',
 'Laborer',
 'Laborer',
 'Teacher',
 'Clerical',
 'Psychologist',
 'Professional',
 'Other',
 'Retail Management',
 'Professional',
 'Truck Driver',
 'Homemaker',
 'Retail Management',
 'Food Service Management',
 'Other',
 'Professional',
 'Sales - Retail',
 'Principal',
 'Other',
 'Sales - Retail',
 'Computer Programmer',
 'Realtor',
 'Administrative Assistant',
 'Other',
 'Sales - Retail',
 'Sales - Commission',
 'Retail Management',
 'Medical Technician',
 'Accountant/CPA',
 'Computer Programmer',
 'Teacher',
 'Tradesman - Mechanic',
 'Social Worker',
 'Retail Management',
 'Engineer - Mechanical',
 'Engineer - Mechanical',
 'Other',
 'Retail Management',
 nan,
 'Police Officer/Correction Officer',
 'Professional',
 'Other',
 'Skilled Labor',
 'Other',
 'Skilled Labor',
 'Computer Programmer',
 'Fireman',
 'Clerical',
 'Professional',
 'Computer Programmer',
 'Civil Service',
 'Other',
 'Other',
 'Professional',
 'Computer Programmer',
 'Biologist',
 'Truck Driver',
 'Executive',
 'Medical Technician',
 'Accountant/CPA',
 'Postal Service',
 'Other',
 'Other',
 ...]
In [52]:
#A lot!, but no problem, data type?
df.Occupation.dtype
Out[52]:
dtype('O')

I ran into an error here, I changed the data type first before attempting to fill in the null values and it refused to work, meaning, I will have to covert back to object and fill in the na values with "Unknown" before switching back to coverting the data type.

In [58]:
# convert to category data type
df.Occupation = df.Occupation.astype('object')
df.Occupation.dtype
Out[58]:
dtype('O')
In [59]:
# replace null with unknown
df.Occupation.fillna('Unknown', inplace = True)
In [60]:
# check for null records
df[df.Occupation.isnull()]
Out[60]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors Year

0 rows × 82 columns

In [62]:
# Good, let us now convert to categorical data type
df.Occupation = df.Occupation.astype('category')
df.Occupation.dtype
Out[62]:
CategoricalDtype(categories=['Accountant/CPA', 'Administrative Assistant', 'Analyst',
                  'Architect', 'Attorney', 'Biologist', 'Bus Driver',
                  'Car Dealer', 'Chemist', 'Civil Service', 'Clergy',
                  'Clerical', 'Computer Programmer', 'Construction', 'Dentist',
                  'Doctor', 'Engineer - Chemical', 'Engineer - Electrical',
                  'Engineer - Mechanical', 'Executive', 'Fireman',
                  'Flight Attendant', 'Food Service',
                  'Food Service Management', 'Homemaker', 'Investor', 'Judge',
                  'Laborer', 'Landscaping', 'Medical Technician',
                  'Military Enlisted', 'Military Officer', 'Nurse (LPN)',
                  'Nurse (RN)', 'Nurse's Aide', 'Other', 'Pharmacist',
                  'Pilot - Private/Commercial',
                  'Police Officer/Correction Officer', 'Postal Service',
                  'Principal', 'Professional', 'Professor', 'Psychologist',
                  'Realtor', 'Religious', 'Retail Management',
                  'Sales - Commission', 'Sales - Retail', 'Scientist',
                  'Skilled Labor', 'Social Worker',
                  'Student - College Freshman',
                  'Student - College Graduate Student',
                  'Student - College Junior', 'Student - College Senior',
                  'Student - College Sophomore', 'Student - Community College',
                  'Student - Technical School', 'Teacher', 'Teacher's Aide',
                  'Tradesman - Carpenter', 'Tradesman - Electrician',
                  'Tradesman - Mechanic', 'Tradesman - Plumber',
                  'Truck Driver', 'Unknown', 'Waiter/Waitress'],
, ordered=False)
In [63]:
# number of investor value count
df.Investors.value_counts()
Out[63]:
1      27814
2       1386
3        991
4        827
5        753
       ...  
665        1
634        1
555        1
752        1
754        1
Name: Investors, Length: 751, dtype: int64
In [64]:
# borrower state value count
df.BorrowerState.value_counts()
Out[64]:
CA    14717
TX     6842
NY     6729
FL     6720
IL     5921
GA     5008
OH     4197
MI     3593
VA     3278
NJ     3097
NC     3084
WA     3048
PA     2972
MD     2821
MO     2615
MN     2318
MA     2242
CO     2210
IN     2078
AZ     1901
WI     1842
OR     1817
TN     1737
AL     1679
CT     1627
SC     1122
NV     1090
KS     1062
KY      983
OK      971
LA      954
UT      877
AR      855
MS      787
NE      674
ID      599
NH      551
NM      472
RI      435
HI      409
WV      391
DC      382
MT      330
DE      300
VT      207
AK      200
SD      189
IA      186
WY      150
ME      101
ND       52
Name: BorrowerState, dtype: int64
In [65]:
# borrower state sum of null values
df.BorrowerState.isnull().sum()
Out[65]:
5515
In [66]:
# check data type
df.BorrowerState.dtype
Out[66]:
dtype('O')
In [67]:
# borrower state % of null values
df.BorrowerState.isnull().mean() * 100
Out[67]:
4.840394252964358

In our BORROWER STATE column, we have 5515 values that are null, amounting to 4.84% of key values missing. hence, we have to fill in the null values with a keyword that is identifiable;"UNIDENTIFIED"

We also need to sort out our BORROWER STATE column in a categorical way to ensure that we have a categorical data

In [69]:
# fill missing state with UNIDENTIFIED
df.BorrowerState.fillna('Unidentified', inplace = True)
In [70]:
#Let us check for null values again
df.BorrowerState.isnull().sum()
Out[70]:
0
In [71]:
# Good, lets convert data type to categorical
df.BorrowerState = df.BorrowerState.astype('category')
In [72]:
# check data type
df.BorrowerState.dtype
Out[72]:
CategoricalDtype(categories=['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL',
                  'GA', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA',
                  'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE',
                  'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI',
                  'SC', 'SD', 'TN', 'TX', 'UT', 'Unidentified', 'VA', 'VT',
                  'WA', 'WI', 'WV', 'WY'],
, ordered=False)
In [73]:
#Let us replace the abbreviated state to full state name
state = {'WA': 'WASHINGTON', 'VA': 'VIRGINIA', 'DE': 'DELAWARE', 
         'DC': 'DISTRICT OF COLUMBIA', 'WI': 'WISCONSIN', 'WV': 'WEST VIRGINIA',
         'HI': 'HAWAII', 'FL': 'FLORIDA', 'WY': 'WYOMING', 
         'NH': 'NEW HAMPSHIRE', 'NJ': 'NEW JERSEY', 'NM': 'NEW MEXICO', 
         'TX': 'TEXAS', 'LA': 'LOUISIANA', 'NC': 'NORTH CAROLINA', 
         'ND': 'NORTH DAKOTA', 'NE': 'NEBRASKA', 'TN': 'TENNESSEE',
         'NY': 'NEW YORK', 'PA': 'PENNSYLVANIA', 'CA': 'CALIFORNIA',
         'NV': 'NEVADA', 'CO': 'COLORADO', 'VI': 'VIRGIN ISLANDS', 
         'AK': 'ALASKA', 'AL': 'ALABAMA',  'AR': 'ARKANSAS', 
         'VT': 'VERMONT', 'IL': 'ILLINOIS', 'GA': 'GEORGIA', 
         'IN': 'INDIANA', 'IA': 'IOWA', 'OK': 'OKLAHOMA', 
         'AZ': 'ARIZONA', 'ID': 'IDAHO', 'CT': 'CONNECTICUT', 
         'ME': 'MAINE', 'MD': 'MARYLAND', 'MA': 'MASSACHUSETTS', 
         'OH': 'OHIO', 'UT': 'UTAH', 'MO': 'MISSOURI',
         'MN': 'MINNESOTA', 'MI': 'MICHIGAN', 'RI': 'RHODE ISLAND',
         'KS': 'KANSAS', 'MT': 'MONTANA', 'MS': 'MISSISSIPPI', 
         'SC': 'SOUTH CAROLINA', 'KY': 'KENTUCKY', 'OR': 'OREGON', 
         'SD': 'SOUTH DAKOTA', 'Unknown': 'UNKNOWN' }
In [74]:
# repalace the abbreviated with the dictionary created
df.BorrowerState = df.BorrowerState.map(state)
#Check the Values
df.BorrowerState.value_counts()
Out[74]:
CALIFORNIA              14717
TEXAS                    6842
NEW YORK                 6729
FLORIDA                  6720
ILLINOIS                 5921
GEORGIA                  5008
OHIO                     4197
MICHIGAN                 3593
VIRGINIA                 3278
NEW JERSEY               3097
NORTH CAROLINA           3084
WASHINGTON               3048
PENNSYLVANIA             2972
MARYLAND                 2821
MISSOURI                 2615
MINNESOTA                2318
MASSACHUSETTS            2242
COLORADO                 2210
INDIANA                  2078
ARIZONA                  1901
WISCONSIN                1842
OREGON                   1817
TENNESSEE                1737
ALABAMA                  1679
CONNECTICUT              1627
SOUTH CAROLINA           1122
NEVADA                   1090
KANSAS                   1062
KENTUCKY                  983
OKLAHOMA                  971
LOUISIANA                 954
UTAH                      877
ARKANSAS                  855
MISSISSIPPI               787
NEBRASKA                  674
IDAHO                     599
NEW HAMPSHIRE             551
NEW MEXICO                472
RHODE ISLAND              435
HAWAII                    409
WEST VIRGINIA             391
DISTRICT OF COLUMBIA      382
MONTANA                   330
DELAWARE                  300
VERMONT                   207
ALASKA                    200
SOUTH DAKOTA              189
IOWA                      186
WYOMING                   150
MAINE                     101
NORTH DAKOTA               52
Name: BorrowerState, dtype: int64
In [75]:
#Good,let us move on to the next stage
In [76]:
# structure of the dataframe
df.shape
Out[76]:
(113937, 82)

Our dataset has 113,937 rows and 81 columns. However, we are not going to work with the whole datasets as our observations are not spanning the whole recorded observation. Our Questions for this dataset include but is not limited to; How many homeowners borrow in a state, which year has the highest rate of borrowing homeowners, which state has the highest amount of borrowing homeowners, is there a correlation between your available bank credit and your borrowing capacity, identifying the frequency of the categorical variables; Term of loan, Borrower's employment status, year of loan, and loan status, are there differences between loans depending on the loan term?.

Since we are not working with the whole dataset, let us reduce our dataset and combine the ones we will need into a data sheet

In [77]:
# select the column into a new datasheet
new_df = df[['LoanOriginationDate','Term', 'Year', 'LoanStatus', 'BorrowerState', 'Occupation', 
                     'EmploymentStatus', 'IsBorrowerHomeowner', 'LoanCurrentDaysDelinquent',
                     'IncomeVerifiable', 'StatedMonthlyIncome', 'LoanOriginalAmount', 
                     'MonthlyLoanPayment','Recommendations', 'Investors','AvailableBankcardCredit']]
In [78]:
# save as a csv file
new_df.to_csv('new_df.csv', index = False)
In [79]:
#check if new file is saved
new_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 16 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   LoanOriginationDate        113937 non-null  datetime64[ns]
 1   Term                       113937 non-null  category      
 2   Year                       113937 non-null  category      
 3   LoanStatus                 113937 non-null  category      
 4   BorrowerState              108422 non-null  object        
 5   Occupation                 113937 non-null  category      
 6   EmploymentStatus           113937 non-null  category      
 7   IsBorrowerHomeowner        113937 non-null  bool          
 8   LoanCurrentDaysDelinquent  113937 non-null  int64         
 9   IncomeVerifiable           113937 non-null  bool          
 10  StatedMonthlyIncome        113937 non-null  int32         
 11  LoanOriginalAmount         113937 non-null  int64         
 12  MonthlyLoanPayment         113937 non-null  int32         
 13  Recommendations            113937 non-null  int64         
 14  Investors                  113937 non-null  int64         
 15  AvailableBankcardCredit    106393 non-null  float64       
dtypes: bool(2), category(5), datetime64[ns](1), float64(1), int32(2), int64(4), object(1)
memory usage: 7.7+ MB
In [80]:
new_df.isnull().sum() 
Out[80]:
LoanOriginationDate             0
Term                            0
Year                            0
LoanStatus                      0
BorrowerState                5515
Occupation                      0
EmploymentStatus                0
IsBorrowerHomeowner             0
LoanCurrentDaysDelinquent       0
IncomeVerifiable                0
StatedMonthlyIncome             0
LoanOriginalAmount              0
MonthlyLoanPayment              0
Recommendations                 0
Investors                       0
AvailableBankcardCredit      7544
dtype: int64

So, our borrower state column has 5515 null values and Available Bank card credit too has 7544 missing values.Let us work on that before we continue

In [81]:
new_df.BorrowerState.fillna('Unknown', inplace = True)
C:\Users\mayowa.ebenezer\AppData\Local\Temp\ipykernel_17648\405810248.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df.BorrowerState.fillna('Unknown', inplace = True)
In [84]:
new_df.isnull().sum()
Out[84]:
LoanOriginationDate             0
Term                            0
Year                            0
LoanStatus                      0
BorrowerState                   0
Occupation                      0
EmploymentStatus                0
IsBorrowerHomeowner             0
LoanCurrentDaysDelinquent       0
IncomeVerifiable                0
StatedMonthlyIncome             0
LoanOriginalAmount              0
MonthlyLoanPayment              0
Recommendations                 0
Investors                       0
AvailableBankcardCredit      7544
dtype: int64
In [85]:
#Available Bank Card Credit up next
new_df.AvailableBankcardCredit.fillna('0', inplace = True)
#Check for null values again
new_df.isnull().sum()
C:\Users\mayowa.ebenezer\AppData\Local\Temp\ipykernel_17648\2755186625.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df.AvailableBankcardCredit.fillna('0', inplace = True)
Out[85]:
LoanOriginationDate          0
Term                         0
Year                         0
LoanStatus                   0
BorrowerState                0
Occupation                   0
EmploymentStatus             0
IsBorrowerHomeowner          0
LoanCurrentDaysDelinquent    0
IncomeVerifiable             0
StatedMonthlyIncome          0
LoanOriginalAmount           0
MonthlyLoanPayment           0
Recommendations              0
Investors                    0
AvailableBankcardCredit      0
dtype: int64
In [86]:
# Done, lets move on
# more stat info
new_df.describe()
Out[86]:
LoanCurrentDaysDelinquent StatedMonthlyIncome LoanOriginalAmount MonthlyLoanPayment Recommendations Investors
count 113937.000000 1.139370e+05 113937.00000 113937.000000 113937.000000 113937.000000
mean 152.816539 5.607707e+03 8337.01385 271.978128 0.048027 80.475228
std 466.320254 7.478491e+03 6245.80058 192.701916 0.332353 103.239020
min 0.000000 0.000000e+00 1000.00000 0.000000 0.000000 1.000000
25% 0.000000 3.200000e+03 4000.00000 131.000000 0.000000 2.000000
50% 0.000000 4.666000e+03 6500.00000 217.000000 0.000000 44.000000
75% 0.000000 6.825000e+03 12000.00000 371.000000 0.000000 115.000000
max 2704.000000 1.750002e+06 35000.00000 2251.000000 39.000000 1189.000000
In [87]:
#last ten values (tail)
new_df.tail(10)
Out[87]:
LoanOriginationDate Term Year LoanStatus BorrowerState Occupation EmploymentStatus IsBorrowerHomeowner LoanCurrentDaysDelinquent IncomeVerifiable StatedMonthlyIncome LoanOriginalAmount MonthlyLoanPayment Recommendations Investors AvailableBankcardCredit
113927 2008-05-09 Medium Term 2008 Completed WASHINGTON Executive Full-time True 0 True 10333 4292 132 2 194 34415.0
113928 2011-06-10 Medium Term 2011 Completed COLORADO Other Full-time False 0 True 2333 2000 73 0 25 3746.0
113929 2013-07-10 Medium Term 2013 Completed FLORIDA Accountant/CPA Employed False 0 True 4333 2500 101 0 26 7537.0
113930 2013-07-10 Medium Term 2013 Current INDIANA Professional Employed True 0 True 6250 3000 106 0 52 1322.0
113931 2014-01-22 Long Term 2014 Current ILLINOIS Analyst Employed False 0 True 8146 25000 565 0 1 12056.0
113932 2013-04-22 Medium Term 2013 Current ILLINOIS Food Service Management Employed True 0 True 4333 10000 364 0 1 1886.0
113933 2011-11-07 Medium Term 2011 FinalPaymentInProgress PENNSYLVANIA Professional Employed True 0 True 8041 2000 65 0 22 6658.0
113934 2013-12-23 Long Term 2013 Current TEXAS Other Employed True 0 True 2875 10000 273 0 119 7853.0
113935 2011-11-21 Long Term 2011 Completed GEORGIA Food Service Full-time True 0 True 3875 15000 449 0 274 4137.0
113936 2014-01-21 Medium Term 2014 Current NEW YORK Professor Employed False 0 True 4583 2000 64 0 1 675.0
In [ ]:
 

Univariate Exploration¶

Let us look at the distributions in our data, and show some relationships between them, this should be fun!

In [88]:
#Using Histogram
def chart(x, title):
    """plot histogram to show 
    dist of numeric variable"""
    """param: x, title"""
    """return none"""
    plt.figure(figsize=(12,10), dpi = 400)
    plt.hist(x = x)
    plt.title(title)
    plt.xlabel('Amount in Dollars)', fontsize = 8)
    plt.ylabel('Distribution', fontsize = 8)
chart(new_df.LoanOriginalAmount, 'Histogram Distibution of Loan Original Amount')

The graph above shows us that, the amount borrowed the most lies to the left of our graph, meaning the graph is RIGHT SKEWED, a case of symmetrical distribution. Most of the loaners (those who borrowed) borrowed at about the 5000 dollars column, there is however, an outlier in between the 30000 and 35000 dollars range.**

In [89]:
#Monthly loan payment graph
chart(new_df.MonthlyLoanPayment, 'Distibution of Monthly Loan Payment')

The data in the above graph are right-skewed, a case of symmetrical distribution

Lets explore further, lets explore the probability density function of the data points.

In [93]:
def density(x, title):
    """plot a kernel density estimate"""
    """param: x, title"""
    """return none"""
    plt.figure(figsize=(12,10), dpi = 400)
    sns.kdeplot(x = x, data = new_df, fill = True)
    plt.xlabel('Amount(Dollars)')
    plt.title(title)
In [95]:
density('LoanOriginalAmount', 'Density Estimate for Loan Original Amount')

The probability that a randomly chosen loan original amount will fall between 5000 dollars and 12000 dollars can be calculated as the area between the density function (graph) and the x-axis in the interval [5000, 12000].

In [96]:
density('MonthlyLoanPayment', 'Density Estimate for Monthly Loan Payment')

The probability that a randomly chosen monthly loan payment will fall between 300 dollars and 500 dollars can be calculated as the area between the density function (graph) and the x-axis in the interval [300, 500]

Let us Plot further on our data frame, to check for more relationships

In [100]:
def univariate_plot(x, title):
    """plot a countplot"""
    """param: x, title"""
    """return none"""
    plt.figure(figsize=(8,6), dpi = 400)
    ax = sns.countplot(x = x, data = new_df)
    for p in ax.patches:
      ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0, p.get_height()+0.05))
    plt.suptitle(title)
    plt.xticks(rotation = 90)
    plt.plot()   
plt.show();
In [102]:
#Loan Term Distribution
univariate_plot('Term','Term of Loan Distribution.')

Loans Disbursed with a medium time frame (36 months) has the highest occurence, followed by the ones disbursed within the long term frame, if this should mean anything, it is that, people generally prefer medium to long term loan as opposed to short term loans that might destabilize their income.

In [103]:
univariate_plot('EmploymentStatus', "Employment Status Distribution.")

From the above graph, we can infer that those who are employed are the highest borrowers, does this mean the emploees are not getting enough money that suits their lifestlye? or the loan company prefers giving out loans to working class individuals?

In [104]:
univariate_plot('Year', 'Loan Distribution by Year.')

Year 2013 spiked up in number of loan disbursed, before dropping back drastically in 2014 what might cause this? Economic crisis? Inflation?

In [106]:
univariate_plot('IsBorrowerHomeowner', "Borrower's Home Owner Distribution.")

The difference between a homeowner and a non homeowner who borrowed is not that specific, but its worth noting that we still tried to find out

In [107]:
univariate_plot('IncomeVerifiable', "Borrower's Verifiable Income Distribution.")

The loan company made sure to not borrow to users whose income is not verifiable, good idea right? No one wants to run into a loss.

In [108]:
univariate_plot('LoanStatus', 'Loan Status Distribution.')

Few borrowers cancelled their loans, majority of the borrowers has current loans, while the second largest distribution are for those who have current loans.

In [120]:
sns.countplot(x='BorrowerState', data=new_df, order=new_df.BorrowerState.value_counts().iloc[:10].index)
plt.xticks(rotation=90)
Out[120]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 [Text(0, 0, 'CALIFORNIA'),
  Text(1, 0, 'TEXAS'),
  Text(2, 0, 'NEW YORK'),
  Text(3, 0, 'FLORIDA'),
  Text(4, 0, 'ILLINOIS'),
  Text(5, 0, 'Unknown'),
  Text(6, 0, 'GEORGIA'),
  Text(7, 0, 'OHIO'),
  Text(8, 0, 'MICHIGAN'),
  Text(9, 0, 'VIRGINIA')])

The plot above shows the top ten(10) borrowers state(s) in the country. California is the top borrowing state followed by Texas.

In [125]:
sns.countplot(x='BorrowerState', data=new_df, order=new_df.BorrowerState.value_counts().index[::-10])
plt.xticks(rotation=90)
Out[125]:
(array([0, 1, 2, 3, 4, 5]),
 [Text(0, 0, 'NORTH DAKOTA'),
  Text(1, 0, 'WEST VIRGINIA'),
  Text(2, 0, 'LOUISIANA'),
  Text(3, 0, 'WISCONSIN'),
  Text(4, 0, 'NORTH CAROLINA'),
  Text(5, 0, 'TEXAS')])

From the above chart, North Dakota has the lowest rate of borrowing followed by West Virginia.

Let us Check for occupation of borrower and see the top and bottom (10)

In [127]:
sns.countplot(x='Occupation', data=new_df, order=new_df.Occupation.value_counts().iloc[:10].index)
plt.xticks(rotation=90)
Out[127]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 [Text(0, 0, 'Other'),
  Text(1, 0, 'Professional'),
  Text(2, 0, 'Computer Programmer'),
  Text(3, 0, 'Executive'),
  Text(4, 0, 'Teacher'),
  Text(5, 0, 'Administrative Assistant'),
  Text(6, 0, 'Analyst'),
  Text(7, 0, 'Unknown'),
  Text(8, 0, 'Sales - Commission'),
  Text(9, 0, 'Accountant/CPA')])

Seems most of the borrowers didnt disclose their occuaptions, anonymity perhaps?

In [129]:
sns.countplot(x='Occupation', data=new_df, order=new_df.Occupation.value_counts().index[::-10])
plt.xticks(rotation=90)
Out[129]:
(array([0, 1, 2, 3, 4, 5, 6]),
 [Text(0, 0, 'Student - Technical School'),
  Text(1, 0, 'Flight Attendant'),
  Text(2, 0, 'Investor'),
  Text(3, 0, 'Fireman'),
  Text(4, 0, 'Tradesman - Mechanic'),
  Text(5, 0, 'Laborer'),
  Text(6, 0, 'Unknown')])

Student-Technical School is the least borrowing occupation.

Extra: For more clarity, let us create a word cloud count for the Occupation

In [141]:
#import word cloud
from wordcloud import WordCloud
occupation_text = " ".join(i for i in new_df['Occupation'])
#Calling a function to create the word cloud
def draw_word_cloud(text, title):
    """Plot a word cloud """
    """param: text, title"""
    """return none"""
    wordcloud = WordCloud(background_color="white", max_font_size=300, width=1024, height=1000, colormap="magma").generate_from_text(text)
    plt.figure(figsize=(20,20))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.margins(x=0, y=0)
    plt.suptitle(title, fontsize=30)
    plt.show();
    
draw_word_cloud(occupation_text, "Borrower's Occupation Word Cloud.")

In the Image in the word cloud above, we can see that Professionals and Computer Programmer are the highest loan borrowers.

Let us check for outliers in our data, using the percentile distribution Here, we will be usig the Loan Original Amount, Monthly loan Payment, Monthly income column(s) etc. to ascertain our values.

In [145]:
#using a function for these plots to enable fast and easy analysis
#of our distribution(s)

def univariate_boxplot(x, title):
    """plot a univariate boxplot"""
    """param:x, title"""
    """return none"""
    plt.figure(figsize=(8,6),dpi = 400)
    sns.boxplot(x = x, data = new_df)
    plt.suptitle(title)
    plt.show();
In [146]:
#for our Loan Original Amount data
univariate_boxplot('LoanOriginalAmount', 'Loan Original Amount Distribution.')

The 25th percentile of the loan original amount variable falls slightly before 5000 dollars, while the median falls above the 5000 dollars, the 75th percentile falls short of 15000 dollars.

In [147]:
# for our monthly loan payment distribution
univariate_boxplot('MonthlyLoanPayment', 'Monthly Loan Payment Distribution.')

The 25th percentile, median and 75th percentile of the monthly loan payment variable falls short of 500 dollars.

In [149]:
# Inevestors distribution
univariate_boxplot('Investors', 'Investors Distribution.')

The 25th percentile, median and 75th percentile of the investors variable falls short of 200 investors.

In [150]:
#Monthly Income
univariate_boxplot('StatedMonthlyIncome', 'Stated Monthly Income Distribution.')

The figure above depicts the distribution of the stated monthly income with outliers indicated with an astericks, due to the bulkiness of the data the first quartile and the rest cannot be visualized from this figure.

SUMMARY¶

To analyse the loan with respect the year, the loan origination date column was converted from object datatype to datetime, afterwards the year was extracted from the datetime before setting the data type of the extracted year column as categorical variable, also the loan term values was trasform from the original values 12months, 36months, and 60months to short term, medium term and long term respectively, it was also changed to a categorical data type to enable easy analyzation. The loan status has values respresenting past due in a number of categories of days, these values were replaced with a single value named 'past due' regardless of the number of days. The borrower state values were transformed from their abbreviated text to full text to enable easy comprehension by non native researchers, we also did not leave out the stated monthy income and monthy loan payment variable out. These variables were converted from float to integer for consistency with the loan amount data type. The occupation column was transformed from object data type to categorical data type.

For the distribution of term, loans disbursed on the medium term has the highest occurence, Employed individuals borrowed the most, followed by the full time category. Year 2013 had the highest number of loan disbursement, followed by the year 2012 and 2014 respectively at second and third position, the least loan disbursement occured in the year 2005 with a occurence of 22 loan disbursement. There was not much difference between if a borrower is a home owner or not and loans were only disbursed to those whose income were verifiable. The distibution of the loan original amount is right-skewed, a case of symmetrical distribution. Most of the loan original amount are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 5000 dollars, there are outliers in the ranges of 32000 dollars and 35000 dollars. The distribution of monthly loan payment is right-skewed, a case of symmetrical distribution. Most of the monthly loan payment are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 173 dollars.

Bivariate Exploration¶

AIM:To identify the relativity of a categorical variable against another categorical variable, this would indicating how much of a categorical variable spread over another categorical variable.

In [151]:
# a func to plot a bivariate countplot
def bivariate_count_plot(x, hue, title):
    """plot a countplot"""
    """param: x, hue title"""
    """return none"""
    plt.figure(figsize=(8,6), dpi = 400)
    ax = sns.countplot(x = x, hue = hue, data = new_df)
    for p in ax.patches:
      ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.15, p.get_height()+0.05))
    plt.suptitle(title)
    plt.xticks(rotation = 90)
    ax.margins(y=0.1)
    plt.xlabel('Loan Status', fontsize = 15)
    plt.tight_layout()
    plt.plot()   
plt.show();
In [154]:
# Function to plot laon status by IsBorrower a home owner or not
bivariate_count_plot('LoanStatus', 'IsBorrowerHomeowner', 'Loan Status by Home Owner status.')

I wanted to check if a borrower status as a home owner will affect the loan Status as being paid or not but found out, it did not really affect it. The current loan status; the highest, shows that home owners are borrowers too

In [155]:
# Function to plot laon status by term
bivariate_count_plot('LoanStatus', 'Term', 'Loan Status by Term Duration.')

The medium term has the highest occurence of loan duration in the distribution from earlier findings, how much does this spread over the loan status category is what the figure above depicts, for the current loan status 20127 were on long term, 36387 were on medium term while just 62 were on the short term. The past due category has a spread of 684 long term duration, 1373 medium term duration, and 10 short term duration. other categories can be visualized from the figure above.

In [162]:
bivariate_count_plot('IncomeVerifiable', 'IsBorrowerHomeowner', 'Is Homeowner(s) Income Verifiable?')

From the above plot, Home owners whose occupations were verifiable were 54073 and those whose werent are 3405. Meaning a larger percentage of home owners also has their income verifiable.

In [164]:
# Function to plot laon status by home ownership
bivariate_count_plot('LoanStatus', 'IsBorrowerHomeowner', 'Loan Status by Borrower Home Ownership.')

The current loan status 30478 are home owners, while 26098 were not home owners. The past due category has a spread of 1003 home ownership, and 1064 no home ownership. The completed loan status category has home ownership of 18280 while 19794 does not own a home.

In [165]:
#Function to plot loan status by Verifiable Income
bivariate_count_plot('LoanStatus', 'IncomeVerifiable', 'Loan Status by Verifiable Income')

The current loan status 52434 has true (verifiable income) , while 4142 were false(no verifiable income). The past due category has 1857 verifiable income borrowers, and 210 non income verifiable borrowers. The completed loan status category has verifiable income borrowers of 35292 while 2792 does not have a verifiable income.

Lets evaluate the mean distribution of loan original amount and stated monthly income group by categorical variables year, loan status, and verifiable income.

In [166]:
# Function to plot Loan Original Amount and 
# Stated Monthly Income group by a feature
def bi_bar_plot(by, groupby, title):
    """plot a barplot on groupby function"""
    """param: by, groupby, title"""
    """return ax, mean"""
    ax = new_df.groupby(by= by)[['LoanOriginalAmount', 'StatedMonthlyIncome']].mean().plot(kind = 'bar', figsize = (26,12), fontsize = 16)
    mean = new_df.groupby(by = groupby).mean()[['LoanOriginalAmount', 'StatedMonthlyIncome']]
    plt.xticks(rotation = 360)
    plt.title(title, fontsize = 25)
    plt.ylabel('Mean Values', fontsize = 20)
    plt.xlabel(by, fontsize = 20)
    return ax, mean;
In [167]:
# Function to group by year
bi_bar_plot('Year', 'Year', 'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Year.')
Out[167]:
(<AxesSubplot:title={'center':'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Year.'}, xlabel='Year', ylabel='Mean Values'>,
       LoanOriginalAmount  StatedMonthlyIncome
 Year                                         
 2005         3576.681818         11122.727273
 2006         4763.325262          4743.353708
 2007         7049.545026          4653.942321
 2008         6021.628289          4618.605869
 2009         4354.858818          5091.406448
 2010         4766.540340          5290.192675
 2011         6692.021108          5659.263359
 2012         7833.842173          5709.352376
 2013        10545.065599          6156.148231
 2014        11912.219520          6329.746796)

The loan original amount has the highest mean value in the year 2014 with a lowest mean value in the year 2005, for the stated monthly income the year 2005 has the highest mean value while year 2008 has the lowest mean value.

In [168]:
# Function to group by loan status
bi_bar_plot('LoanStatus', 'LoanStatus', 'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Loan Status.')
Out[168]:
(<AxesSubplot:title={'center':'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Loan Status.'}, xlabel='LoanStatus', ylabel='Mean Values'>,
                         LoanOriginalAmount  StatedMonthlyIncome
 LoanStatus                                                     
 Cancelled                      1700.000000          2608.600000
 Chargedoff                     6398.916694          4485.688793
 Completed                      6189.093239          5324.206204
 Current                       10360.835018          6152.938667
 Defaulted                      6486.798525          4366.463930
 FinalPaymentInProgress         8346.121951          6311.419512
 Past Due                       8258.437349          5366.540397)

The loan original amount has the highest mean value in the current loan status category and a lowest mean value in the cancelled loan category, for the stated monthly income the loan status final payment in progress has the highest mean value with loan status cancelled with the lowest mean values.

In [169]:
# Function to group by verifiable income
bi_bar_plot('IncomeVerifiable', 'IncomeVerifiable', 'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Verifiable Income.')
Out[169]:
(<AxesSubplot:title={'center':'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Verifiable Income.'}, xlabel='IncomeVerifiable', ylabel='Mean Values'>,
                   LoanOriginalAmount  StatedMonthlyIncome
 IncomeVerifiable                                         
 False                    7111.896413          5022.028838
 True                     8437.904368          5655.938519)

In the stated monthly income variable, the highest mean value is the true value of verifiable income variable while the lowest mean value is the false value of verifiable income variable.

The stated monthly income variable the highest mean value is true value of verifiable income variable with a lowest mean value in the false value of verifiable income variable.

Lets establish any relationship or correlation between the continuous numerical variables; stated monthly income, loan original amount, and monthly loan payment.

In [170]:
# define a function to plot lineplot
def line_plot(x, y, title):
    """plot a line plot"""
    """param: x, y, title"""
    """return none"""
    plt.figure(figsize=(10,8), dpi= 450)
    p = sns.lineplot(x = x, y = y, data = new_df)
    plt.title(title); 
In [171]:
# call func to plot lineplot of stated monthly income against loan original amount
line_plot('StatedMonthlyIncome', 'LoanOriginalAmount', 'Line Plot Depicting Relatioship Between Stated Monthly Income and Loan Original Amount.')

There is no positive correlation between the two variables as the figure above shows that majority of the values fall between 0 and 35000 dollars

In [172]:
#Function to depict lineplot of monthly loan payment against original loan amount
line_plot('LoanOriginalAmount', 'MonthlyLoanPayment', 'Line Plot Depicting Relationship Between Monthly Loan Payment and Loan Original Amount.')

As the original loan amount increases the monthly loan payment also increases, meaning there is a rush to pay off the loan as quickly as possible which means, there is a positive correlation between our two variables.

In [183]:
line_plot('Year', 'Investors', 'Line Plot Depicting Relationship Between Years and Investors.')

From year 2005 to mid-2008 and 2010, there was a shar increase in investors invloved in the loan business, but this dropped from mid 2008 to 2014.

SUMMARY

I wanted to check if a borrower status as a home owner will affect the loan Status as being paid or not but found out, it did not really affect it. The current loan status; the highest, shows that home owners are borrowers too

The medium term has the highest occurence of loan duration in the distribution from earlier findings, how much does this spread over the loan status category is what the figure above depicts, for the current loan status 20127 were on long term, 36387 were on medium term while just 62 were on the short term. The past due category has a spread of 684 long term duration, 1373 medium term duration, and 10 short term duration. other categories can be visualized from the figure above.

Home owners whose occupations were verifiable were 54073 and those whose werent are 3405. Meaning a larger percentage of home owners also has their income verifiable.

The current loan status 30478 are home owners, while 26098 were not home owners. The past due category has a spread of 1003 home ownership, and 1064 no home ownership. The completed loan status category has home ownership of 18280 while 19794 does not own a home.

The loan original amount has the highest mean value in the year 2014 with a lowest mean value in the year 2005, for the stated monthly income the year 2005 has the highest mean value while year 2008 has the lowest mean value.

The loan original amount has the highest mean value in the current loan status category and a lowest mean value in the cancelled loan category, for the stated monthly income the loan status final payment in progress has the highest mean value with loan status cancelled with the lowest mean values.

There is no positive correlation between stated monthly income and loan original amount

As the original loan amount increases the monthly loan payment also increases, meaning there is a rush to pay off the loan as quickly as possible which means, there is a positive correlation between our two variables.

From year 2005 to mid-2008 and 2010, there was a shar increase in investors invloved in the loan business, but this dropped from mid 2008 to 2014.

MULTIVARIATE EXPLORATION¶

AIM: showcase the relationship between three variables, two continous numerical variable and a categorical variable.

In [184]:
# Function to plot scatter plot loan original amount against current days of delinquency 
def scatter(hue, title):
    """plot a scatterplot"""
    """param: hue, title"""
    """return none"""
    plt.figure(figsize=(12,10), dpi = 450)
    sns.scatterplot(x = 'LoanOriginalAmount', y = 'MonthlyLoanPayment', hue = hue, data = new_df)
    plt.xlabel('Loan Original Amount(Dollars)')
    plt.ylabel('Monthly Loan Payment(Dollars)')
    plt.title(title);
In [185]:
#Function to PLot Scatter plot on group by loan term
scatter('Term', 'Original Loan Amount Against Current Days of Delinquency Group by Loan Term.')

There exists a positive relationship between the loan original amount and monthly loan payment, the data point(s) are spread across the plot categorized by term of loan.

In [186]:
#Function to PLot Scatter plot by year
scatter('Year', 'Original Loan Amount Against Current Days of Delinquency Group by Loan Year.')

From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by year of loan.

In [187]:
#Function to plot scatterplot on group by verifiable income
scatter('IncomeVerifiable', 'Original Loan Amount Against Current Days of Delinquency Group by Verifiable Income.')

From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by verifiable income.

In [188]:
# Function to plot scatter plot group by borrower home ownership status
scatter('IsBorrowerHomeowner', 'Original Loan Amount Against Current Days of Delinquency Group by Borrower Home Ownership.')

From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by home ownership.

In [190]:
# Function to plot scatter plot on group by employment status
scatter('EmploymentStatus', 'Original Loan Amount Against Current Days of Delinquency Group Employment Status.')

From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by employment status.

In [192]:
# Function to plot scatter plot on group by loan status
scatter('LoanStatus', 'Original Loan Amount Against Current Days of Delinquency Group Loan Status.')

From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by loan status.

Let us check for the correlation between numerical variables based on their linear properties by plotting a heatmap.

In [193]:
plt.figure(figsize=(10,8), dpi = 400)
colormap = sns.color_palette('Greens')
sns.heatmap(new_df.corr(), annot = True, cmap = colormap, center = 0)
plt.title('Correlation Matrix Depicting Relationship Between Variable with Heatmap.')
Out[193]:
Text(0.5, 1.0, 'Correlation Matrix Depicting Relationship Between Variable with Heatmap.')

We can deduce from the heatmap that, we have a positive correlation between loan original amount and monthly loan payment with a correlation coefficient of 0.93, the stated monthly income and the monthly loan payment seems to have no correlation between them as well as the income verifiable and the Loans Current Days deliquent column.

SUMMARY¶

From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across a scatterplot categorized by term of loan, year, verifiable income, home ownership, loan status, and employment status plotted on different scatter plot to better depict the spread. We could also deduce from the findings the correlation of continuous numerical variable by the heatmap, we could see we have a positive correlation between loan original amount and monthly loan payment with a correlation coefficient of 0.93, the stated monthly income and the original loan amount seems to have no correlation between them as well as the income verifiable and the Loans Current Days deliquent column.

CONCLUSIONS¶

To analyse the loan with respect the year, the loan origination date column was converted from object datatype to datetime, afterwards the year was extracted from the datetime before setting the data type of the extracted year column as categorical variable, also the loan term values was trasform from the original values 12months, 36months, and 60months to short term, medium term and long term respectively, it was also changed to a categorical data type to enable easy analyzation. The loan status has values respresenting past due in a number of categories of days, these values were replaced with a single value named 'past due' regardless of the number of days. The borrower state values were transformed from their abbreviated text to full text to enable easy comprehension by non native researchers, we also did not leave out the stated monthy income and monthy loan payment variable out. These variables were converted from float to integer for consistency with the loan amount data type. The occupation column was transformed from object data type to categorical data type.

For the distribution of term, loans disbursed on the medium term has the highest occurence, Employed individuals borrowed the most, followed by the full time category. Year 2013 had the highest number of loan disbursement, followed by the year 2012 and 2014 respectively at second and third position, the least loan disbursement occured in the year 2005 with a occurence of 22 loan disbursement. There was not much difference between if a borrower is a home owner or not and loans were only disbursed to those whose income were verifiable. The distibution of the loan original amount is right-skewed, a case of symmetrical distribution. Most of the loan original amount are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 5000 dollars, there are outliers in the ranges of 32000 dollars and 35000 dollars. The distribution of monthly loan payment is right-skewed, a case of symmetrical distribution. Most of the monthly loan payment are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 173 dollars.

I wanted to check if a borrower status as a home owner will affect the loan Status as being paid or not but found out, it did not really affect it. The current loan status; the highest, shows that home owners are borrowers too. The medium term has the highest occurence of loan duration in the distribution from earlier findings, how much does this spread over the loan status category is what the figure above depicts, for the current loan status 20127 were on long term, 36387 were on medium term while just 62 were on the short term. The past due category has a spread of 684 long term duration, 1373 medium term duration, and 10 short term duration. other categories can be visualized from the figure above. Home owners whose occupations were verifiable were 54073 and those whose werent are 3405. Meaning a larger percentage of home owners also has their income verifiable.The current loan status 30478 are home owners, while 26098 were not home owners. The past due category has a spread of 1003 home ownership, and 1064 no home ownership. The completed loan status category has home ownership of 18280 while 19794 does not own a home.The loan original amount has the highest mean value in the year 2014 with a lowest mean value in the year 2005, for the stated monthly income the year 2005 has the highest mean value while year 2008 has the lowest mean value.The loan original amount has the highest mean value in the current loan status category and a lowest mean value in the cancelled loan category, for the stated monthly income the loan status final payment in progress has the highest mean value with loan status cancelled with the lowest mean values.There is no positive correlation between stated monthly income and loan original amount, as the original loan amount increases the monthly loan payment also increases, meaning there is a rush to pay off the loan as quickly as possible which means, there is a positive correlation between our two variables. From year 2005 to mid-2008 and 2010, there was a shar increase in investors invloved in the loan business, but this dropped from mid 2008 to 2014.

Lastly, From earlier findings there existed a positive relationship between the loan original amount and monthly loan payment, the data point are spread across a scatterplot categorized by term of loan, year, verifiable income, home ownership, loan status, and employment status plotted on different scatter plot to better depict the spread. We could also deduce from the findings the correlation of continuous numerical variable by the heatmap, we could see we have a positive correlation between loan original amount and monthly loan payment with a correlation coefficient of 0.93, the stated monthly income and the original loan amount seems to have no correlation between them as well as the income verifiable and the Loans Current Days deliquent column. I just had to make sure it was correct by crosschecking using the Multivariate Exploration option

In [ ]: